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Abstract 

Much recent progress in Yision-to-Language (V2L) prob¬ 
lems has been achieved through a combination of Convolu¬ 
tional Neural Networks (CNNs) and Recurrent Neural Net¬ 
works (RNNs). This approach does not explicitly represent 
high-level semantic concepts, but rather seeks to progress 
directly from image features to text. In this paper we in¬ 
vestigate whether this direct approach succeeds due to, or 
despite, the fact that it avoids the explicit representation of 
high-level information. We propose a method of incorporat¬ 
ing high-level concepts into the successful CNN-RNN ap¬ 
proach, and show that it achieves a significant improvement 
on the state-of-the-art in both image captioning and visual 
question answering. We also show that the same mechanism 
can be used to introduce external semantic information and 
that doing so further improves performance. We achieve the 
best reported results on both image captioning and VQA on 
several benchmark datasets, and provide an analysis of the 
value of explicit high-level concepts in V2L problems. 
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Figure 1. Our attribute based V2L framework. The image analy¬ 
sis module learns a mapping between an image and the semantic 
attributes through a CNN. The language module learns a mapping 
from the attributes vector to a sequence of words using an LSTM. 


1. Introduction 

Vision-to-Language problems present a particular chal¬ 
lenge in Computer Vision because they require translation 
between two different forms of information. In this sense 
the problem is similar to that of machine translation be¬ 
tween languages. In machine language translation there 
have been a series of results showing that good performance 
can be achieved without developing a higher-level model of 
the state of the world. In [3, 7, 47], for instance, a source 
sentence is transformed into a fixed-length vector represen¬ 
tation by an ‘encoder’ RNN, which in turn is used as the 
initial hidden state of a ‘decoder’ RNN that generates the 
target sentence. 

Despite the supposed equivalence between an image and 
1000 words, the manner in which information is represented 
in each data form could hardly be more different. Human 
language is designed specihcally so as to communicate in¬ 
formation between humans, whereas even the most care¬ 


fully composed image is the culmination of a complex set 
of physical processes over which humans have little con¬ 
trol. Given the differences between these two forms of in¬ 
formation, it seems surprising that methods inspired by ma¬ 
chine language translation have been so successful. These 
RNN-based methods which translate directly from image 
features to text, without developing a high-level model of 
the state of the world, represent the current state of the art 
for key Vision-to-Language {V2L) problems, such as image 
captioning and visual question answering. 

This approach is reflected in many recent successful 
works on image captioning, such as [6, 10, 23, 36, 50, 55]. 
Current state-of-the-art captioning methods use a CNN as 
an image ‘encoder’ to produce a fixed-length vector repre¬ 
sentation [25, 29, 45, 48], which is then fed into the ‘de¬ 
coder’ RNN to generate a caption. 

Visual Question Answering (VQA) is a more recent chal¬ 
lenge than image captioning. In this V2L problem an image 
and a free-form, open-ended question about the image are 


































































presented to the method which is required to produce a suit¬ 
able answer [2]. Same as image captioning, the current state 
of the art in VQA [13, 35, 43] relies on passing CNN fea¬ 
tures to an RNN language model. 

Our main contribution is to consider the question: what 
value do explicit high level concepts have in V2Lproblems? 
That is, given that significant performance improvements 
have been achieved by moving to models which directly 
pass from image features to text, should we give up on high- 
level concepts in V2L altogether? We investigate particu¬ 
larly the impact that adding high-level information to the 
CNN-RNN framework has upon performance. We do this 
by inserting an explicit representation of attributes of the 
scene which are meaningful to humans. Each semantic at¬ 
tribute corresponds to a word mined from the training image 
descriptions, and represents higher-level knowledge about 
the content of the image. A CNN-based classiher is trained 
for each attribute, and the set of attribute likelihoods for an 
image forms a high-level representation of image content. 
An RNN is then trained to generate captions, or answer 
questions, on the basis of the likelihoods. 

Our second contribution is a fully trainable attribute 
based neural network that can be applied to multiple V2L 
problems which yields significantly better performance than 
current state-of-the-art approaches. For example, in the Mi¬ 
crosoft COCO Captioning Challenge, we produce a BLEU- 
1 score of 0.73, which is the state of the art on the leader- 
board at the time of writing. Our hnal model also pro¬ 
vides the state-of-the-art performance on several recently 
released VQA datasets. For instance, our system yields a 
WUPS @0.9 score of 71.15, compared with the current state 
of the art of 66.78, on the Toronto COCO-QA single word 
question answering dataset. On the VQA (test-standard), 
an open-answer task dataset, our method achieves 55.84% 
accuracy, while the baseline is 54.06%. Moreover, with 
an expansion from image-sourced attributes to knowledge- 
sourced through WordNet (see Section 5.3), we further im¬ 
prove the accuracy to 57.62%. 

2. Related Work 

Image Captioning The problem of annotating images 
with natural language at the scene level has long been stud¬ 
ied in both computer vision and natural language process¬ 
ing. Hodosh et al. [17] proposed to frame sentence-based 
image annotation as the task of ranking a given pool of 
captions. Similarly, [15, 19, 40] posed the task as a re¬ 
trieval problem, but based on co-embedding of images and 
text in the same space. Recently, Socher et al. [46] used 
neural networks to co-embed image and sentences together 
and Karpathy et al. [23] co-embedded image crops and sub¬ 
sentences. Neither attempted to generate novel captions. 

Attributes have been used in many image captioning 
methods to hll the gaps in predetermined caption templates. 


Farhadi et al. [12], for instance, used detections to infer a 
triplet of scene elements which is converted to text using a 
template. Fi et al. [30] composed image descriptions given 
computer vision based inputs such as detected objects, mod¬ 
ifiers and locations using web-scale n-grams. A more so¬ 
phisticated CRF-based method that uses attribute detections 
beyond triplets was proposed by Kulkami et al. [26]. The 
advantage of template-based methods is that the resulting 
captions are more likely to be grammatically correct. The 
drawback is that they still rely on hard-coded visual con¬ 
cepts and suffer the implied limits on the variety of the out¬ 
put. Instead of using fixed templates, more powerful lan¬ 
guage models based on language parsing have been devel¬ 
oped, such as [1,27,28,39]. 

Fang et al. [11] won the 2015 COCO Captioning Chal¬ 
lenge with an approach that is similar to ours in as much 
as it applies a visual concept (i.e., attribute) detection pro¬ 
cess before generating sentences. They hrst learned 1000 
independent detectors for visual words based on a multi¬ 
instance learning framework and then used a maximum en¬ 
tropy language model conditioned on the set of visually de¬ 
tected words directly to generate captions. Differently, our 
visual attributes act as a high-level semantic representation 
for image content which is fed into an FSTM which gen¬ 
erates target sentences based on a much larger word vocab¬ 
ulary. More importantly, the success of their model relies 
on a re-scoring process from a joint image-text embedding 
space. To what extent the high-level concepts help in image 
captioning (and other V2L tasks) is not discussed in their 
work. Instead, this is the main focus of this paper. 

In contrast to the aforementioned two-stage methods, the 
recent dominant trend in V2L is to use an architecture which 
connects a CNN to an RNN to learn the mapping from im¬ 
ages to sentences directly. Mao et al. [36], for instance, pro¬ 
posed a multimodal RNN (m-RNN) to estimate the proba¬ 
bility distribution of the next word given previous words 
and the deep CNN feature of an image at each time step. 
Similarly, Kiros et al. [24] constructed a joint multimodal 
embedding space using a powerful deep CNN model and an 
FSTM that encodes text. Karpathy et al. [22] also proposed 
a multimodal RNN generative model, but in contrast to [36], 
their RNN is conditioned on the image information only at 
the hrst time step. Vinyals et al. [50] combined deep CNNs 
for image classihcation with an FSTM for sequence mod¬ 
eling, to create a single network that generates descriptions 
of images. Chen et al. [6] learned a bi-directional mapping 
between images and their sentence-based descriptions us¬ 
ing RNN. Xu et al. [53] proposed a model based on visual 
attention, as well as You et al. [56]. Jia et al. [18] applied 
additional retrieved sentences to guide the FSTM in gener¬ 
ating captions. Devlin et al. [9] combined both maximum 
entropy (ME) language model and RNN to generate cap¬ 
tions. 


Interestingly, this end-to-end CNN-RNN approach ig¬ 
nores the image-to-word mapping which was an essential 
step in many of the previous image captioning systems de¬ 
tailed above [12, 26, 30, 54], The CNN-RNN approach has 
the advantage that it is able to generate a wider variety of 
captions, can be trained end-to-end, and outperforms the 
previous approach on the benchmarks. It is not clear, how¬ 
ever, what the impact of bypassing the intermediate high- 
level representation is, and particularly to what extent the 
RNN language model might be compensating. Donahue et 
al. [10] described an experiment, for example, using tags 
and CRF models as a mid-layer representation for video 
to generate descriptions, but it was designed to prove that 
LSTM outperforms an SMT-based approach [44]. It re¬ 
mains unclear whether the mid-layer representation or the 
LSTM leads to the success. Our paper provides several 
well-designed experiments to answer this question. 

We thus here show not only a method for introducing 
a high-level representation into the CNN-RNN framework, 
and that doing so improves performance, but we also inves¬ 
tigate the value of high-level information more broadly in 
V2L tasks. This is of critical importance at this time because 
V2L has a long way to go, particularly in the generality of 
the images and text it is applicable to. 

Visual Question Answering Visual question answering 
is one of the more challenging, and interesting, V2L tasks 
as it requires answering previously unseen questions about 
image content [2, 13, 32, 33, 34, 35, 43, 59]. This is as 
opposed to the vast majority of challenges in Computer Vi¬ 
sion in which the question is specihed long before the pro¬ 
gram is written. Both Gao et al. [13] and Malinowski et 
al. [35] used RNNs to encode the question and output the 
answer. Ren et al. [43] focused on questions with a single¬ 
word answer and formulated the task as a classihcation 
problem using an LSTM, and released a single-word answer 
dataset (Toronto COCO-QA). Ma et al. [32] used CNNs to 
both extract image features and sentence features, and fuse 
the features together with a multi-modal CNN. Antol et al. 
[2] proposed a large-scale open-ended VQA dataset based 
on COCO, which is called VQA. They also provided sev¬ 
eral baseline methods which combined both image features 
(CNN extracted) and question features (LSTM extracted) to 
obtain a single embedding and further built a MLP (Multi- 
Layer Perceptron) to obtain a distribution over answers. 

3. An Attribute-based V2L Model 

Our approach is summarized in Figure 1. The model 
includes an image analysis part and a language generation 
part. In the image analysis part, we first use supervised 
learning to predict a set of attributes, based on words com¬ 
monly found in image captions. We solve this as a multi¬ 
label classihcation problem and train a corresponding deep 
CNN by minimizing an element-wise logistic loss function. 


Secondly, a hxed length vector Vatt{I) is created for each 
image /, whose length is the size of the attribute set. Each 
dimension of the vector contains the prediction probability 
for a particular attribute. In the language generation part, 
we apply an LSTM-based sentence generator. Our attribute 
vector Vatt{I) is used as an input to this LSTM. For dif¬ 
ferent tasks, we have different language models. For image 
captioning, we follow [50] to generate sentences from an 
LSTM; for single-word question answering, as in [43], we 
use the LSTM as a classiher providing a likelihood for each 
potential answer; for open-ended question answering, we 
use an encoder LSTM to encode questions while the second 
LSTM decoder uses the attribute vector Vatt{I) to generate 
a sentence based answer. A baseline model is also imple¬ 
mented for each of the three tasks. In the baseline model, 
as in [13, 43, 50] we use a pre-trained CNN to extract im¬ 
age features CNN(/) which are fed into the LSTM directly. 
For the sake of completeness a hne-tuned version of this ap¬ 
proach is also implemented. The baseline method is used as 
a counterpart to verify the effectiveness of the intermediate 
attribute prediction layer for each task. 

3.1. The Attribute Predictor 

We hrst build an attributes vocabulary regardless of the 
final tasks {i.e. image captioning, VQA). Unlike [26, 54], 
that use a vocabulary from separate hand-labeled training 
data, our semantic attributes are extracted from training cap¬ 
tions and can be any part of speech, including object names 
(nouns), motions (verbs) or properties (adjectives). The 
direct use of captions guarantees that the most salient at¬ 
tributes for an image set are extracted. We use the c most 
common words in the training captions to determine the at¬ 
tribute vocabulary. In contrast to [11], our vocabulary is not 
tense or plurality sensitive (done manually), for instance, 
'ride' and 'riding' are classified as the same seman¬ 
tic attribute, similarly 'bag' and 'bags'. This signifi¬ 
cantly decreases the size of our attribute vocabulary. We fi¬ 
nally obtain a vocabulary with 256 attributes. Our attributes 
represent a set of high-level semantic constructs, the totality 
of which the LSTM then attempts to represent in sentence 
form. Generating a sentence from a vector of attribute like¬ 
lihoods exploits a much larger set of candidate words which 
are learned separately (see Section 3.2 for more details). 

Given this attribute vocabulary, we can associate each 
image with a set of attributes according to its captions. We 
then wish to predict the attributes given a test image. Be¬ 
cause we do not have ground truth bounding boxes for at¬ 
tributes, we cannot train a detector for each using the stan¬ 
dard approach. Fang et al. [11] solved a similar problem 
using a Multiple Instance Learning framework [58] to de¬ 
tect visual words from images. Motivated by the relatively 
small number of times that each word appears in a caption, 
we instead treat this as a multi-label classihcation problem. 



Figure 2. Attribute prediction CNN: the model is initialized from 
VggNet [45] pre-trained on ImageNet. The model is then fine- 
tuned on the target multi-label dataset. Given a test image, a set 
of proposal regions are selected and passed to the shared CNN, 
and finally the CNN outputs from different proposals are aggre¬ 
gated with max pooling to produce the final multi-label prediction, 
which gives us the high-level image representation, Vatt (I) 


To address the concern that some attributes may only apply 
to image sub-regions, we follow Wei et al. [5 1 ] in designing 
a region-based multi-label classification framework. 

Figure 2 summarizes the attribute prediction network. 
In contrast to [51], which uses AlexNet [25] as the ini¬ 
tialization of the shared CNN, we use the more powerful 
VggNet [45] pre-trained on ImageNet [8]. This model has 
been widely used in image captioning tasks [6, 11, 22, 36]. 
The shared CNN is then fine-tuned on the target multi-label 
dataset (our image-attribute training data). In this step, the 
output of the last fully-connected layer is fed into a c-way 
softmax. The c = 256 here represents the attribute vocab¬ 
ulary size. In contrast to [51] who employs the squared 
loss, we find that element-wise logistic loss function per¬ 
forms better. Suppose that there are N training examples 
and yi = [yn^ya, yic] is the label vector of the im¬ 
age, where yij = 1 if the image is annotated with attribute 
j, and yij = 0 otherwise. If the predictive probability vec¬ 
tor is Pi = [pii,Pi2, ■■■,Pic], then the cost function to be 
minimized is 


N 


log(l + exp{-yijpij)) 


( 1 ) 


i=i j=i 


During the fine-tuning process, the parameters of the last 
fully connected layer (i.e. the attribute prediction layer) are 
initialized with a Xavier initialization [14]. The learning 
rates of ‘fc6’ and ‘fc7’ of the VggNet are initialized as 
0.001 and the last fully connected layer is initialized as 0.01. 
All the other layers are fixed during training. We executed 
40 epochs in total and decreased the learning rate to one 
tenth of the current rate for each layer after 10 epochs. The 
momentum is set to 0.9. The dropout rate is set to 0.5. 

To predict attributes based on regions, we first extract 
hundreds of proposal windows from an image. However, 


considering the computational inefficiency of deep CNNs, 
the number of proposals processed needs to be small. Sim¬ 
ilar to [51], we first apply the normalized cut algorithm to 
group the proposal bounding boxes into m clusters based 
on the loU scores matrix. The top k hypotheses in terms 
of the predictive scores reported by the proposal generation 
algorithm are kept and fed into the shared CNN. In con¬ 
trast to [51], we also include the whole image in the hy¬ 
pothesis group. As a result, there are mk + 1 hypotheses 
for each image. We set m = 10, A: = 5 in all experiments. 
We use Multiscale Combinatorial Grouping (MCG) [42] for 
the proposal generation. Finally, a cross hypothesis max¬ 
pooling is applied to integrate the outputs into a single pre¬ 
diction vector Vatt{I)- 

3.2. Language Generator 

Similar to [22, 36, 50], we propose to train a language 
generation model by maximizing the probability of the cor¬ 
rect description given the image. However, rather than us¬ 
ing image features directly as in typically the case, we use 
the semantic attribute prediction probability Vatt{I) from 
the previous section as the input. Suppose that {S'!,..., S'^} 
is a sequence of words. The log-likelihood of the words 
given their context words and the corresponding image can 
be written as: 

L 

\ogp{S\Vau{I)) ^Y.^ogp{St\Si :t — l 5 ^att {!)) (2) 

t=i 

where p{St\Si.,t-i,Vatt{I)) is the probability of generat¬ 
ing the word St given attribute vector Vatt{I) and previous 
words Si-t-i- We employ the LSTM [ 16], a particular form 
of RNN, to model this. See Figure 3 for different language 
generators designed for multiple V2L tasks. 

Image Captioning Model The LSTM model for image 
captioning is trained in an unrolled form. More formally, 
the LSTM takes the attributes vector Vatt{I) and a sequence 
of words S = {So,S l, Sl+i), where Sq is a special start 
word and Sl+i is a special END token. Each word has 
been represented as a one-hot vector St of dimension equal 
to the size of words dictionary. The words dictionaries are 
built based on words that occur at least 5 times in the train¬ 
ing set, which lead to 8791 words on MS COCO datasets. 
Note it is different from the semantic attributes vocabulary 
Vatt- The training procedure is as following (see Eigure 3 
(a)) ; At time step t — —1, we set x_i = WeaVatt{I) 
and hinitiai = 0, where Wea is the learnable attributes em¬ 
bedding weights. The LSTM memory state is initialized 
to the range (—0.1, 0.1) with a uniform distribution. This 
gives us an initial LSTM hidden state /i_i which can be 
used in the next time step. Erom f = 0 to f = L, we 
set xt = WesSt and the hidden state ht-i is given by the 
previous step, where Wes is the learnable word embedding 





























































Figure 3. Language generators for different types of tasks; (a) Im¬ 
age Captioning, (b) VQA-single word, (c) VQA-sentence. red ar¬ 
row indicates our attributes input Vatt{I) while blue dash arrow 
shows the baseline method input CNN(7). 

weights. The probability distribution pt+i over all words 
is then computed by the LSTM feed-forward process. Fi¬ 
nally, on the last step when Sl+i represents the last word, 
the target label is set to the END token. 

Our training objective is to learn parameters Wea, Wee 
and all parameters in LSTM by minimizing the following 
cost function; 

1 ^ 

C = -^^logp(5W|K„(jW)) + Ae-||0||^(3) 

AT lF) + i 

= + • Il^'ll2 (4) 

where N is the number of training examples and is the 
length of the sentence for the f-th training example, pt ) 
corresponds to the activation of the Softmax layer in the 
LSTM model for the i-th input and 0 represents model pa¬ 
rameters, Ae • ll^lli is a regularization term. We use SGD 
with mini-batches of 100 image-sentence pairs. The at¬ 
tributes embedding size, word embedding size and hidden 
state size are all set to 256 in all the experiments. The learn¬ 
ing rate is set to 0.001 and clip gradients is 5. The dropout 
rate is set to 0.5. 

Question Answering Model For question answering, 
a triplet {Vatt(J), {Qi, Ql}, ^t}} is given, 

whereas L and T is the length of the question and answer, 
separately. We define it to be a single-word answering prob¬ 
lem when T = 1 and a sentence-based problem if T > 1. 

For the single-word answering problem, the LSTM takes 
the attributes score vector Vatt{I) and a sequence of input 
words of the question Q = (Qi,..., Ql)- The feed-forward 


process is the same as image captioning, except that an 
END token is not required anymore. Instead, we use the 
word generated by the last word of the question as the pre¬ 
dicted answer (see Figure 3 (b)). Hence, the cost function is 
C = —^ logp(A(*i)-|-Ae-||0|||, where TV is the num¬ 

ber of training examples. logp(A*^*^) is the log-probability 
distribution over all candidate answers that is computed by 
the last LSTM cell, given the previous hidden state and the 
last word of question Ql. 

For the sentence-based question answering, we have a 
question encoding LSTM and an answer decoding LSTM. 
However, different from Gao et al. [13] using two sepa¬ 
rates LSTMs for question and answer, weights between our 
encoding and decoding LSTMs are shared. The informa¬ 
tion stored in the LSTM memory cells of the last word in 
the question is treated as the representation of the sentence. 
And its hidden state will be used as the initial state of the an¬ 
swering LSTM part. Moreover, different from [13, 35, 43] 
who use CNN features directly, we use our attributes repre¬ 
sentations Vatt (I) as the input for decoding LSTM (see Fig¬ 
ure 3 (c)). The cost function of sentence-based question an- 
sweringisC = 

where + 1 is the length of the answer plus one END 
token for the *-th training example. According to training 
configuration, the learning rate is set to 0.0005 and other 
parameters are same as image captioning configuration. 

4. Image Captioning 

4.1. Dataset 

There are several datasets which consist of images and 
sentences describing them in English. We mainly report re¬ 
sults on the popular Microsoft COCO [3 1 ] dataset. Results 
on ElickrSk [17] and Elickr30k [57] can be found in the sup¬ 
plementary material. MS COCO contains 123,287 images, 
and each image is annotated with 5 sentences. Because 
most previous work in image captioning [10, 11, 22, 36, 50, 
53] is not evaluated on the official test split of MS COCO, 
for fair comparison, we report results with the widely used 
publicly available splits in the work of [22], which use 5000 
images for validation, and 5000 for testing. We further 
tested on the actual MS COCO test set consisting of 40775 
images (human captions for this split are not available pub¬ 
licly), and evaluated them on the COCO evaluation server. 

4.2. Evaluation 

Metrics We report results with the frequently used BLEU 
metric and sentence perplexity (WC). BLEU [41] scores 
are originally designed for automatic machine translation 
where they measure the fraction of n-grams (up to 4-gram) 
that are in common between a hypothesis and a reference or 
set of references. Here we compare against 5 references. 
Perplexity is a standard measure for evaluating language 











































































models which measures how many bits on average would be 
needed to encode each word given the language model, so a 
low WC means a better language model. Additionally, we 
evaluate our model based on the metrics METEOR [4], and 
ClDEr [49]. All scores (except WC) are computed with 
the coco-evaluation code [5]. 

Baselines To verify the effectiveness of our attribute rep¬ 
resentation, we provide a baseline method. The baseline 
framework is the same as that proposed in section 3.2, ex¬ 
cept that the attributes vector Vatt{I) is replaced by the last 
hidden layer of CNN directly (see the blue arrow in Eig- 
ure 3). Various CNN architectures are applied in the base¬ 
line method to extract image features, such as VggNet[45] 
and GoogLeNet[48]. Eor the VNetn-LSTM, we use the 
second fully connected layer (fc7), which has 4096 di¬ 
mensions. In VNet-PCA-rLSTM, PCA is applied to de¬ 
crease the feature dimension from 4096 to 1000. Eor the 
GNetH-LSTM, we use the GoogleNet model provided in the 
Caffe Model Zoo [20] and the last average pooling layer is 
employed, which is a 1024-d vector. VNetH-ftH-LSTM ap¬ 
plies a VggNet that has been fine-tuned on the target dataset, 
based on the task of image-attributes classification. 

Our Approaches We evaluate several variants of our ap¬ 
proach; Att-GT-rLSTM models use ground-truth attributes 
as the input while Att-CNN-rLSTM uses the attributes 
vector Vatt{I) predicted by the attributes prediction net¬ 
work in section 3.1. We also evaluate an approach Att- 
SVM-hLSTM with linear SVM (C = 1) predicted attributes 
vector. SVM classifiers are trained to divide positive at¬ 
tributes from those negatives given an image-attributes cor¬ 
respondence. We use the second fully connected layer of 
the fine-tuned VggNet to feed the SVM. To infer the sen¬ 
tence given an input image, we use Beam Search, which 
iteratively considers the set of b best sentences up to time t 
as candidates to generate sentences at time f -f 1, and only 
keeps the best b results. We set the b as 5. 

Results Table 1 reports image captioning results on the 
COCO. It is not surprising that Att-GT-rLSTM model per¬ 
forms best, since ground truth attributes labels are used. 
We report the results just to show the advances of adding 
an intermediate image-to-word mapping stage. Ideally, if 
we are able to train a strong attributes predictor which 
gives us a good enough estimation of attributes, we could 
obtain an outstanding improvement comparing with both 
baselines and state-of-the-arts. Indeed, apart from using 
ground truth attributes, our Attributes-CNN-rLSTM mod¬ 
els generate the best results over all evaluation metrics. Es¬ 
pecially comparing with baselines, which do not contain 
an attributes prediction layer, our final models bring sig- 
nihcant improvements, nearly 15% for B-1 and 30% for 
CIDEr on average. VNetH-ftH-LSTM model performs bet- 


State-of-art 

B-1 

B-2 

B-3 

B-4 

M 

c 

V 

NeuralTalk [22] 

0.63 

0.45 

0.32 

0.23 

0.20 

0.66 

- 

Mind’s Eye [6] 




0.19 

0.20 


11.60 

NIC [50] 




0.28 

0.24 

0.86 

- 

LRCN [10] 

0.67 

0.49 

0.35 

0.25 



- 

Mao et al.[36] 

0.67 

0.49 

0.34 

0.24 



13.60 

Jia et a]. [18] 

0.67 

0.49 

0.36 

0.26 

0.23 

0.81 

- 

MSR[1I] 




0.26 

0.24 


18.10 

Xu et al.[53] 

0.72 

0.50 

0.36 

0.25 

0.23 


- 

Jin et a]. [21] 

0.70 

0.52 

0.38 

0.28 

0.24 

0.84 

_ 

Baseline-CAWC/) 


VNet+LSTM 

0.61 

0.42 

0.28 

0.19 

0.19 

0.56 

13.58 

VNet-PCA+LSTM 

0.62 

0.43 

0.29 

0.19 

0.20 

0.60 

13.02 

GNet+LSTM 

0.60 

0.40 

0.26 

0.17 

0.19 

0.55 

14.01 

VNet+ft+LSTM 

0.68 

0.50 

0.37 

0.25 

0.22 

0.73 

13.29 

Ours-t4tt(7) 


Att-GT+LSTM^ 

0.80 

0.64 

0.50 

0.40 

0.28 

1.07 

9.60 

Att-SVM+LSTM 

0.69 

0.52 

0.38 

0.28 

0.23 

0.82 

12.62 

Att-CNN+LSTM 

0.74 

0.56 

0.42 

0.31 

0.26 

0.94 

10.49 


Table 1. BLEU-1,2,3,4, METEOR, CIDEr and PPC metrics com¬ 
pared with other state-of-the-art methods and our baseline on MS 
COCO dataset. ]: indicates ground truth attributes labels are used, 
which (in gray [) will not participate in rankings. 


COCO-TEST 

B-1 

B-2 

B-3 

B-4 

M 

R 

CIDEr 

5-Refs 

Ours 

0.73 

0.56 

0.41 

0.31 

0.25 

0.53 

0.92 

Human 

0.66 

0.47 

0.32 

0.22 

0.25 

0.48 

0.85 

MSR[1I] 

0.70 

0.53 

0.39 

0.29 

0.25 

0.52 

0.91 

m-RNN [36] 

0.68 

0.51 

0.37 

0.27 

0.23 

0.50 

0.79 

LRCN [10] 

0.70 

0.53 

0.38 

0.28 

0.24 

0.52 

0.87 

40-Refs 

Ours 

0.89 

0.80 

0.69 

0.58 

0.33 

0.67 

0.93 

Human 

0.88 

0.74 

0.63 

0.47 

0.34 

0.63 

0.91 

MSR[1]] 

0.88 

0.79 

0.68 

0.57 

0.33 

0.66 

0.93 

m-RNN [36] 

0.87 

0.76 

0.64 

0.53 

0.30 

0.64 

0.79 

LRCN [10] 

0.87 

0.77 

0.65 

0.53 

0.32 

0.66 

0.89 


Table 2. COCO evaluation server results. M and R stands for ME¬ 
TEOR and ROUGE-L. Results using 5 references and 40 refer¬ 
ences captions are both shown. We only list the comparison results 
that have been officially published in the corresponding references. 


ter than other baselines because of the fine-tuning on the 
target dataset. However, they do not perform as good as our 
attributes-based models. Att-SVMn-LSTM under-performs 
Att-CNN+LSTM means our region-based attributes pre¬ 
diction network performs better than the SVM classifier. 
Our final model also outperforms current state of the arts 
listed in tables. We also evaluate an approach that com¬ 
bines CNN features and attributes vector together as the in¬ 
put of the LSTM, but we find this approach (B-1 =0.71) is 
not as good as using attributes vector alone in the same set¬ 
ting. In any case, above experiments show that an interme¬ 
diate image-to-words stage (i.e. attributes prediction layer) 
brings us significant improvements. Results on ElickrSk and 
Elickr30k can be found in the supplementary material, as 
well as some qualitative results. 

We further generated captions for the images in the 
COCO test set containing 40,775 images and evaluated 
them on the COCO evaluation server. These results are 
shown in Table 2. We achieve 0.73 on B-1, and surpass 
human performances on 13 of the 14 metrics reported. We 
are the best results on 3 evaluations metrics (B-1,2,3) on the 



















Ours 

NIC[50] 

LRCN[10] 

m-RNN[.^6] 

NeuralTalk[:'] 

VIS Input Dim 

256 

1000 

1000 

4096 

4096 

RNN Dim 

256 

512 

1000x4 

256 

300-600 


Table 3. Visual feature input dimension and properties of RNN. 
Our visual features has been encoded as a 256-d attributes score 
vector while other models need higher dimensional features to 
feed to RNN. According to the unit size of RNN, we achieve state- 
of-the-art using a relatively small dimensional recurrent layer. 

server leaderboard at the time of writing this paper. We also 
achieve the top-5 ranking on the other evaluation metrics. 

Table 3 summarizes some properties of recurrent layers 
employed in some recent RNN-based methods. We achieve 
state-of-the-art using a relatively small dimensional visual 
input feature and recurrent layer. Lower dimension of visual 
input and RNN normally means less parameters in the RNN 
training stage, as well as lower computation cost. 

5. Visual Question Answering 

5.1. Dataset 

We report VQA results on two recently publicly avail¬ 
able visual question answering datasets, both are created 
based on MS COCO. Toronto COCO-QA dataset [43] con¬ 
tains four types of questions, specifically the object, num¬ 
ber, color and location. The answers are all single-word. 
We use this dataset to examine our single-word question 
answering model. VQA [2] is a much larger dataset which 
contains 614,163 questions. These questions and answers 
are sentence-based and open-ended. The training and test¬ 
ing split follows COCO official split, which contains 82,783 
training images, 40,504 validation images and 81,434 test 
images, each has 3 questions and 10 answers. We use the 
official test split for our testing. 

5.2. Evaluation 

Our experiments in question answering are designed to 
verify the effectiveness of introducing the intermediate at¬ 
tribute layer. Hence, apart from listing several state of art 
methods, we focus on comparing with a baseline method, 
which only uses the second fully connected layer (fc7) of 
the VggNet (and a fine-tuned VggNet) as the input. 

Table 4 reports results on the Toronto COCO-QA 
dataset, within which all answers are a single-word. Besides 
the accuracy value (the proportion of correct answered test¬ 
ing questions to the total testing questions), the Wu-Palmer 
similarity (WUPS) [52] is also used to measure the per¬ 
formance of different models. The WUPS calculates the 
similarity between two words based on the similarity be¬ 
tween their common subsequence in the taxonomy tree. If 
the similarity between two words is greater than a threshold 
then the candidate answer is assumed to be right. We fol¬ 
low [32, 43] in setting the threshold as 0.9 and 0.0. GUESS 
is a simple baseline to predict the most common answer 


Toronto COCO-QA 

Acc 

WUPS @0.9 

WUPS @0.0 

GUESS [43] 

6.65 

17.42 

73.44 

VIS-I-BOW143] 

55.92 

66.78 

88.99 

VIS-rLSTM[43] 

53.31 

63.91 

88.25 

2-VIS-i-BLSTM[43] 

55.09 

65.34 

88.64 

Ma et al.[32] 

54.94 

65.36 

88.58 

BaseLine 


VggNet-LSTM 

50.73 

60.37 

87.48 

VggNet-rft-LSTM 

58.34 

67.32 

89.13 

Our-Proposal 


Att-GT-rLSTM* 

67.66 

75.76 

93.63 

Att-CNN+LSTM 

61.38 

71.15 

91.58 


Table 4. Accuracy, WUPS@0.9 and WUPS@0.0 metrics com¬ 
pared with other state-of-the-art methods and our baseline on the 
Toronto COCO-QA dataset. Each image has one question and 
only a single word answer is given for each. ]: indicates that ground 
truth attributes labels were used, and thus that the method does not 
participate in rankings. 

from the training set based on the question type. The modes 
are ‘cat’, ‘two’, ‘white’, and ‘room’ for the four types of 
questions. VIS-hBOW [43] performs multinomial logistic 
regression based on image features and a BOW vector ob¬ 
tained by summing all the word vectors of the question. 
VIS-hLSTM [43] has one LSTM to encode the image and 
question, while 2 -VIS-hBLSTM has two image feature in¬ 
put points, at the start and the end of the sentences. Ma et 
al. [32] encoded both images and questions by CNN. From 
the Table 4, we clearly see that our attribute-based model 
outperforms the baselines and all state-of-the-art methods 
by a significant degree, which proves the effectiveness of 
our attribute-based representation for V2L tasks. 

Table 5 summarizes the results on the test split of VQA 
dataset. In contrast to the above single-word question an¬ 
swering task, here we follow [2], and measure performance 
by recording the percentage of answers in agreement with 
ground truth from human subjects. Antol et al. [2] pro¬ 
vided a baseline for this dataset using a Q-hI method, which 
encodes the image with CNN features and questions with 
LSTM representation. Then they train a softmax neural 
network classifier with a single hidden layer and the out¬ 
put space is the 1000 most frequent answers in the train¬ 
ing set. Human performance is also given in [2] for refer¬ 
ence. VNetH-ftH-LSTM is the model with fine-tuned Vg¬ 
gNet features. It is slightly less accurate than our ex¬ 
plicit attributes based model Att-CNN-rLSTM, but the gap 
is small. LSTM Q-hI [2] can be treated as our baseline 



All 

Test-dev 

Y/N Num 

Others 

All 

Test-standard 

Y/N Num 

Others 

Q+i [;] 

52.64 

75.55 

33.67 

37.37 




- 

LSTM Q [2] 

48.76 

78.20 

35.68 

26.59 

48.89 

78.12 

34.94 

26.99 

LSTM Q-bl [2] 

53.74 

78.94 

35.24 

36.42 

54.06 

79.01 

35.55 

36.80 

Human [2] 





83.30 

95.77 

83.39 

72.67 

VNet-bft-bLSTM 

55.03 

78.19 

35.47 

39.68 

55.34 

78.10 

35.30 

40.27 

Att-CNN-t-LSTM 

55.57 

78.90 

36.11 

40.07 

55.84 

78.73 

36.08 

40.60 

Att-KB+LSTM 

57.46 

79.77 

36.79 

43.10 

57.62 

79.72 

36.04 

43.44 


Table 5. Results on test-dev and test-standard split of VQA dataset 
compared with [2]. 



















as it uses CNN features as the input to the LSTM, while 
LSTM Q only provides questions as the input. Our at¬ 
tributes based model outperforms LSTM Q-hI nearly in all 
cases, especially when the answer types are ‘others’. Our 
hypothesis is that this performance increase occurs because 
the separately-trained attribute layer discards irrelevant im¬ 
age information. This ensures that the LSTM does not in¬ 
terpret irrelevant variations in the expression of the text as 
relating to irrelevant image details, and try to learn a map¬ 
ping between them. 

However, there is still a big gap between our proposed 
models and the human performance. After looking into de¬ 
tails, we notice that accuracies on some question types such 
as ‘why’ are very low. These kinds of questions are hard to 
answer because commonsense knowledge and reasoning is 
normally required. Zhu et al. [59] cast a MRF model into 
a Knowledge Base representation to answer commonsense- 
related visual questions. Our semantic attribute representa¬ 
tion offers hope of a solution, however, as it can be used 
as a key by which to source other, external information. 
In the following experiment, we propose to expand our 
image-based attributes set to a knowledge-based attributes 
set through a large lexical ontology - the WordNet. 

5.3. Attribute Expansion using WordNet 

WordNet [38] records a variety of relationships between 
words, some of which we hope to use to address the many 
ways of expressing the same idea in natural language. The 
most frequently encoded relation is the hyponymy (such as 
bed and bunkbed). Meronymy represents the part-whole 
relation. Verb synsets are arranged into hierarchies (tro- 
ponyms) (such as buy-pay). All these relationships are 
dehned based on commonsense knowledge. 

To expand our image-sourced attributes to knowledge- 
sourced information, we hrst select candidate words from 
WordNet. Candidate words must fulfill two selection crite¬ 
ria. The hrst is that the word must directly linked with an 
arbitrary word in our attribute vocabulary Vatt through the 
WordNet. Secondly, the candidate word must appear in at 
least 5 training question examples. In our experiment, given 
M = 256 image-sourced attributes, we hnally mined a 
knowledge-sourced vocabulary Vkb with N = 9762 words, 
and Vkb has covered all the words in Vatt- Then, a sim¬ 
ilarity matrix S G is computed based on a pre¬ 

trained word2vec model [37], where gives both seman¬ 
tic and syntactic similarity between word i in Vatt and word 
j in Vkb- Given an image / and its image-sourced at¬ 
tribute vector VattU) = (viH ..., vill,predicted 
by the attribute prediction network, the component of 
the knowledge-sourced attribute vector is obtained by a 

max-pooling operator = max{v[^\ 

(i) (i) 

where vf = x Sij. The hnal knowledge-sourced at- 


Question-Type 

Vgg+LSTM 

Att-CNN+LSTM 

Att-KB+LSTM 

why 

3.04 

7.77 

9.88 

what kind 

24.15 

41.22 

45.23 

which 

31.28 

36.60 

37.28 

is the 

71.49 

73.22 

74.59 

is this 

73.00 

75.26 

76.63 


Table 6. Results on the open-answer task for some commonsense 
reasoning question types on validation split of VQA. 

tributes vector Vkb{I) = , ---, will be fed 

into the LSTM to generate answers. 

Table 6 compares results using image-sourced attributes 
vs. knowledge-sourced on the validation split of VQA 
dataset. We gain a signihcant improvement in common- 
sense reasoning related questions. For example, on the 
‘why’ questions, we achieve 9.88%. Our hypothesis is that 
this reflects the fact that indexing into WordNet in this man¬ 
ner provides some independence as to the exact manner of 
expression used in the text, but also adds extra information. 
In answering questions about beds and hammocks, for ex¬ 
ample, it is useful to know that both are related to sleep. 
The overall performance of this Att-KB-rLSTM model on 
the test split of VQA can be found in the Table 5. Our over¬ 
all result is 57.62% accuracy, which performs better than the 
model of Att-CNN-rLSTM (the model before attributes ex¬ 
pansion) and achieves the state-of-the-art result on the VQA 
dataset. 

6. Conclusion 

We have described an investigation into the value of high 
level concepts in V2L problems, motivated by the belief that 
without an explicit representation of the content of an image 
it is very difficult to answer reason about it. In the process 
we examined the effect of introducing an intermediate at¬ 
tribute prediction layer into the predominant CNN-LSTM 
framework. We implemented three attribute-based models 
for the tasks of image captioning, single-word question an¬ 
swering and sentence question answering. 

We have shown that an explicit representation of im¬ 
age content improves V2L performance, in all cases. In¬ 
deed, at the time of writing this paper, our image captioning 
model outperforms the state of the art on several captioning 
datasets. Our question answering models perform best on 
the Toronto COCO-QA datasets, producing an accuracy of 
61.38%. It also achieves the state of the art on the VQA, 
at 57.62%, which is a big improvement over the baseline. 
Moreover, attribute representation enables access to high- 
level commonsense knowledge, which is necessary for an¬ 
swering commonsense reasoning related questions. 
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